Reason Why

“At the beginning of FY2020, the team of Marketing, Comms, and Sales were challenged by a new business objective: High Velocity Merchants (HVM). After an initial market research, we realized the potential of this new target and the difficulties of identifying these companies with the tools available at the time. As the information required (company name, investment stage, founders, events, associations, etc) was mostly of opened access, we decided to use data analytics in our favor to create a project that could benefit the marketing and communications efforts in regard of lead generation, content topics, advertising strategies, branding awareness and events mapping.”

  • Estefanía Granados. Marketing Specialist.

“On top of the mountain of our ambitions, we are looking to make PayU, within the next 3 years, the number one payment company in the fast-growing markets, as well as the number one player in full financial services in specific regions, and to build our own ecosystem. Our goals are clear, and we know that we will only achieve this by focusing our efforts on innovation. What does it mean? This means making PayU a data-driven company and putting data at the core of our strategies. It means having a team that experiences a data culture and is not limited to its original area of knowledge. Through this project we are changing our mindset. We are taking Communication and Marketing out of the”support areas" position, and turning them into market intelligence entry doors. Here we are planting a seed of Communication and Marketing Intelligence that can guide leaders and teams in wiser business decisions. ‘Decoding’ who HVMs are and how they think, is just the beginning of what we can offer in terms of stronger targets, KPIs, and actions."

  • Fabiana Paiva. Communication Manager Latam.

“This mapping exercise not only allows us to better understand the customer profiles we are looking for. Also gives us a concrete guideline to establish an effective communication and resources investment strategy, in order to attract them.”

  • Ángela Bohorquez. SSC Marketing Manager.


Objectives

High Velocity Merchants (HVM) were initially conceived as companies valued over 1 billion dollars that had not executed a merge or Initial Public Offering (IPO). They usually were known for having passed through multiple funding rounds and be at a late investment stage.

As the concept was relatively new and there was ambiguity about what it means to be “at a late investment stage”, we decided to map available companies at all investment states.

This project is the effort to identify HVM’s behavioural and topic patterns in their global ecosystem of entrepreneurship at an advanced investment level through its points of interaction. Its main objectives are:

  • Identify who are the HVM, where are they and what events they attend to.
  • Identify the conversational topics they reveal on social media.

To pursue these objectives, Fidelio made a digital research based on web sites dedicated to gather companies’ information and events. Angel.co, Crunchbase, 10times and Twitter are the main data souces.

These data sources were used to consolidate a worksheet database with information about:

  • Companies
  • Founders
  • Venture Capitalists
  • Events
  • Institutions

Twitter was used as the main data source for social media. Its content served to model the latent topics on all HVM’s conversations available online.


Methodology


  • Name of the project: HVM Ecosystem Mapping.
  • Data collection date: 15 Jun 2019 - 17 Aug 2019.
  • Company responsible for the study: FIDELIO DIGITAL S A S.
  • Company sponsoring the study: PAYU.
  • Objective group:
    • Companies in multiple investment stages.
    • Founders of those companies.
    • Venture Capitalists investors of those companies.
    • Events worldwide.
    • Institutions organizers of those events.
  • Sample design: targeted unweigthed sampling.
  • Sample analysis: Cross Industry Standard Process for Data Mining (CRISP-DM).
  • Sample framework: angel.co, crunchbase.com, 10times.com, twitter.com.
  • Sample size:
    • 4,205 companies.
    • 5,540 founders.
    • 487 venture capitalists.
    • 8,080 events
    • 2,155 institutions
  • Data collection technique: web scraping.
  • Geographical scope: worldwide.
  • Error margin: does not apply.
  • Delivery report date: Aug 19, 2019.


This work has been developed thanks to open source technologies. Python: pandas, numpy, matplotlib, wordcloud, nltk, sklearn, collections, gensim, bokeh. R: knitr, readxl, data.table, plotly, RColorBrewer, DT, bubbles. JavaScript, HTML and CSS.

Glosary

  • Word cloud: An electronic image that shows words used in a particular piece of electronic text or series of texts. The words are different sizes according to how often they are used in the text. For each Word Cloud in this document, you will find the semantic roots of every word. We grouped it to preserve consistency between words. For example: “developer”, “development” and “develop” would match into the same root word: “develop”. Source
  • Topic modelling: In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Source


Mapping Tables


Use the navigation bar on your left to step across the tables or select one of the links below:

  1. Companies. Total of 4,205 companies.
  2. Founders. Total of 5,540 founders.
  3. Venture capitalists. Total of 487 venture capitalists.
  4. Events & Institutions. Total of 8,080 events and 2,155 institutions.


Companies

4,377 companies were mapped, including 172 categorized as Closed in the “Operating Status” column. Those latter companies were added to the database yet excluded in this analysis leaving a total of 4,205.



Among the main words identified on the companies’ description we find “platform”, “provid” (meaning provider), “onlin”, “develop” and “servic”. It is implied on the words that companies are mostly related to online platforms that provide services and solutions with technology and data. i.e. mainly digital companies.

Besides columns shown at the table, we gathered 94 columns for each company. Among the most important are: Categories, Funding Status, IPO Status, Facebook, LinkedIn, Twitter, Contact Email, Phone Number, Description, Total Funding Amount, Number of Funding Rounds, Number of Investors, Number of Lead Investors Number of Current Team Members Number of Articles an Number of Events, among many other.


Verticals

In order to give better insights, all categories in this report were merged with PayU’s main verticals:

Additional categories were added in order to preserve consistency among companies and between industries.

352 (8.4%) companies did not have a category so they were taken out of this pie chart.

PayU’s main verticals represent 69% of the categories in the database. Most of them are Digital Services (28.2%), Direct Selling (17.5%) and Software (16.6%).

The category with most revenue in USD on average is Aerospace ($653M USD), Digital Services ($471M USD) and Fintech ($169M USD). Since Aerospace does not have a significant amout of companies (10), the category with most sales, on average, is Digital Services.

In regard of monthly visitors, Software as a category is significantly higher than the rest with an average monthly visit of 67 millions. The top companies are: YouTube (24 billions), Quora (589 millions) and Zhihu (286 millions).

Government and NGO’s have the most amount of tech products. However, the amount of companies working in those categories is not representative (11 and 4). The category with most tech products is Uber Model Sharing (31).


Location - Country

84.3% of the companies mapped are located in the United States (75.2%), United Kingdom (4%), China (3%) and India (2.1%).

For each country, we mapped the average annual USD revenue, monthly visitors, technological products, team members, events, articles (made by them), number of funding rounds and number of investors. Countries with relatively low amount of companies (<9) are not going to be taken into consideration for this specific analysis.

Finland ($266M USD), India ($258M USD) and United States ($245M USD) take the lead on average income per company.

In regard of average monthly visitors, Indonesia takes the lead with 48M visits. The amount of companies in Indonesia is relatively small (9) so it makes sense to zoom in to the most visited company: Bukalapak. It is an e-commerce company with 82M visits monthly.

Use of technological products inside these companies is vital to garantee a competitive advantage. The ones with the most are Indonesia and the next country, with relatively high companies (81) is India. This country is worldwide known for its computational capabilities and proof of that is the amount of technological products every company uses on average: 28.

Belgium and Israel take the lead on average team members (10 and 9, respectively). Switzerland and US on number of Board Members (6 and 4).

On average, the countries with the most visits to events are United States (6), Sweden (6), France (6) and Belgium (6). This insight can serve as an input to tune the events participation strategy worldwide towards these countries.

Finally, the countries with the most investors, on average, are Sweden (14), Singapore (11) and United States (9).

Location - PayU Regions

On average, India leads the annual revenue with $258.7M USD. Follows USA and Canada with $238.8M USD and Asia with 98.4M USD. Not surprisingly, USA/CA and India lead the average monthly visitors of all regions (18.6M and 9.7M respectively).

On average, India and Asia lead the average funding rounds (6 and 4.4), while India, Brazil and USA/CA lead on average investors per company (9).


Funding

54.8% of companies are part of PayU’s main focus as they are passing through funding stages (Early and Late Stage Venture). Merged & Acquired (M&A) companies take 36.1% of the database.

Public companies (IPO), the ones that list into the stock exchange, represent 1.74% of the data.

We decided to take into consideration for the analysis all companies since the definition of “High Velocity Merchants” is still been tested.

Companies with Private Equity are the ones that have not passed through an investment state, so they are financed by own equity and debt. They can go from small and medium business to big private companies. The average number of investors in this type of companies is 7. Even though they represent 1.26% of the amount in the database, they account for 34% (736M USD) of the Funding Amount mapped.


Team

53% of companies in the database have between 11 and 200 employees and 14% have between 201 and 5.000 employees. Very few (44) have more than 5.000 employees and most of them are in the United States.

The difference between Employees and Team Members is that the latter is the leadership team (VP’s, managers, C-level employees).

Companies between 501 and 1000 are the ones that assists the most to events (13 per company, on average), while companies between 1001 and 5000 publish the most articles, on average.


Contact Information

This barchart is how much filled the contact information in the database is; since not all data is public or is not centralized.


Founders



Founders refer themselves at their job role (like “CEO”, “CTO”, “entrepreneur” or “investor”), work approach (work, design, technolog, develop) and their academic experience (studi, stanford university, University California).


Top Influence Founder

This bubble chart represents those founders with most connections in Angel.co and most followers on Twitter. It is filtered by those who have more than 600 connections to facilitate the visualization.

  • The color of the bubble represent the number of companies a founder has, according to Angel.co.
  • The size of the bubble represents a weighted indicator that considers 30% of Twitter followers and 70% of Angel.co connections. This distribution gives more weight to Angel.co because this datasource is considered to have less noise in regard of personal connections and it is used for professional to find jobs. Besides, between 5 and 30% of Twitter followers are fake: they’re bots, spam accounts, inactive users, propaganda, or other non-engaged/non-real users according to Sparktoro.com

Of the 5,540 founders that are part of this study, Angel.co exposed that there are 444 founders owners of more than 1 merchant. David Gutelius, David Cancel, Harj Taggar are among the most influential with more than 1 company. Regardless of their company number, Paolo Privieta, Richard Titus, Micah Badwin and Danielle Morill would be the most influencial founders according to the data.


Contact Information

Founders personal contact like mail or mobile use to be hidden in their social media profile. Linkedin is the channel with more structured information and more business context use, thus the possibility to get in touch by Sales Navigator (Linked In) is very recommendable.


Venture Capitalists



Venture Capitalists describe themselves in relation to investment as venture capitalists and in relation to what kind of enterprise they are looking for (startup, early stage). They don’t show their academic or work background, neither their funds origin.


Location

USA, India, China, Rusia and UK group 60% of Venture Capitalists. Most of them are in the USA (33%).


Employees

76% of venture capitals in the database have between 1 and 10 employees. Very few (7) have more than 5.000 employees and most of them are in the United States.


Events & Institutions

Events happening in the next months:


Event’s Locations

USA, Canada, China and India are hosts of 3809 events (48% of all global events mapped).


Event’s Dates

Due to the fact that all events were gathered during July and August of 2019, most of the events are from those dates. During the next year, April is having the peak of the year with 167 events.


Events’s Verticals

The industries with more events are HR, Jobs & Career, Antiques & Philately, Veterinary, Aerospace & Telecommunication. The events with the most visitors are Gifts & Gifting, Fashion & Beauty, Architecture & Designing, IT & Technology and Business Services.


Visitors by Verticals & Regions

On all regions, Direct Selling is the category with most visitors. Travel is the second most visited category in Asia and India and Agtech & Food is the second most visited category among EMEA and SSC.


Events and Companies crosstab

This is a crosstabing of companies and events using verticals and countries as common column. The top 3 events with most visitors and top 3 companies with most funding rounds for each category were considered.

Institutions

2,155 of institutions are in USA, Canada, India and China,


Companies’ content



A total of 1,353 Twitter accounts were scraped, cleaned, and analized. Most of the content generated by those companies is related to development, software, technology, money, businesses, sites, engagement, and part of it related to women in tech as well.


Verticals’ social status

As the number of companies in Aerospace and Governtment is relatively low (10), we are not going to take them into consideration.

When the average number of followers on Twitter (Avg Followers) is ordered from highest to lowest, Software and Advertising are on the top as well as they are on the top of average monthly visitors. However, when average annual USD revenue is considered, Digital Services, Travel and Fintech lead the top regardless of the twitter metrics. It is as if the number of followers were not attached completely to the amount of revenue annually.

Regions’ social status

As there seems to be an increase on average annual USD revenue based on average number of Twitter followers, India highlights as an atipical value: has a significantly lower number of Twitter followers yet has more revenue, on average, than USA and Canada (USA/CA). It might be implying that Twitter might not be as relevant as it is in North America and Asia.


Content Clustering

On content analysis, two Natural Language Processing techniques were used: t-SNE (for content clustering) and LDA (for topic modelling).

After iterating over several number of topics, 16 topics were selected due to a relatively high coherence metric and an extended number of topics to take content from.

Plot

This chart is a visualization of the t-SNE technique and is designed to reduce the dimensionality of the data (high amount of accounts, words and topics) into one single 2D projection with all content available and latent clusters highlighted.

Each dot is a company’s account and the color encodes its latent topic. When the user hovers the mouse over a company the vignette shows the topic, the category, the Twitter username and a 50-word sample of the content each company generates.

This technique has one relevant assumption: topics were selected based on the highest probability that a company belongs to a certain topic. Due to that, certain topics will not be shown as their probability might be less than the most relevant ones. Also, every topic belongs to one exclusive company; however, it is expected in reality that a company talks about multiple topics.

Clusters descriptive analysis

  • Cluster 1 (pink) is mostly related to health industry and their companies content contains words related to doctors, health, cancer and some of those companies are in the biotechnological industry.
  • Cluster 2 (purple) contains one company and it is Tiki (@tikivn). It is “Vietnam’s fastest and most trusted B2C e-commerce platform, across all categories” (taken from their Twitter account) and most of its content is not related to the rest of the content so it stands as a single cluster.
  • Cluster 4 (red) contains most of spanish and portuguese speaking companies as their content is in those languages.
  • Cluster 7 (orange) groups mostly english speaking companies inside the Software, Digital Servies, Content and Advertising industries.
  • As cluster 8 (beige) is the second biggest cluster among companies contains information of several topics, verticals and accounts: on the right there is content related to entertainment and food; at the center there is information related to health, fitness and technology; and on the left content related to events and advertising.
  • Cluster 9 (green) is similar to Cluster 2 in regard of content language: it generates most of its content in italian and it is based in Milano. @YouApplix: “Applix is a company focused on Customer Interaction through mobile solutions that enable brands, publishers and institutions to create their digital strategy.”
  • Cluster 12 (blue) is the biggest cluster. It contains most of the companies and their content is related to technology, entrepreneurship, Artificial Intelligence, team, digital, contact, support, customer, apps and solutions.
  • Cluster 13 (light green) contains also information related to digital services and is relatively similar to Cluster 12. However, this one goes deep into the security, privacy and protection of information as their content (and companies) are more related to it.
  • Cluster 15 (light blue) groups tech companies and asian companies as their content and quantity highlights. It has tech companies from US (@gamemix), Indian companies (@udaandotcom) and chinese companies (@DXYinfo and @wenzhihu).
  • Clusters 3, 5, 6, 10, 11, 14 and 16 do not appear as they did not stand as the topic with the highest probability of appearing on any company.

As this tools helps visualize cluster’s formation among companies, the content insights will be taken mostly from LDA’s Topic Modelling.


Topic Modelling

Latent Dirichlet Allocation (LDA) is a technique for Topic Modelling that considers:

  • Each account’s content is a weighted combination of several topics: @account1 = 0.1 (Topic1) + 0.05 (Topic2) + 0.03 (Topic3) + ...
  • Each topic has its groupped representative keywords with their own relevance (probability): Topic 1 = 0.3 (learn) + 0.2 (join) + 0.1 (company) + ...

This view is divided into two sections:

Left module

There is a cartesian map showing the topics that bloomed from all documents. The buttons on the top left help navigate through every topic. The size of each bubble (topic) represent their margin among all documents. For example, a margin of 10% would imply that this topic is capturing 10% of all content mapped.

The PC1 and PC2 axis represent the topics location by calculing their principal components. It is a way to represent the most salient terms for each topic on a two dimensional scale; easy to visualize. As the biggest topics (1, 2, 3 and 4) are located at the right side of the plane, most of the relevant content is located on that area.

Use the control on the top left to navigate between all 16 topics.

Right module

The module on the right shows the most salient terms for each topic and among all documents. By default, the most relevant words are shown for all content. Hover the computer’s mouse over a topic on the left to show the top 30 most salient words on the right along with its participation (proportion) among the whole content (% of tokens).

The control panel on the top right represents the relevance metric and it is associated with how relevant is a word (token) respect to a topic (\(\lambda\) = 0) or respect to all documents (\(\lambda\) = 1). If lambda is close to 0, words highlighted will be associated with that specific topic, rather than to all content. Studies suggest a lambda equal to 0.6; yet it depends on the origin of the data. We will consider three situations: \(\lambda\) = 0, \(\lambda\) = 0.6 and \(\lambda\) = 1.


Main topics analysis

Every topic reveal insightful information about what companies are talking about on social media. Certain topics even reveal multiple categories inside their same topic.

Topic 1 (43% of tokens)
  • Women in tech: women, women in tech, technology, tech, software engine, business, product, techstar, manage, leader, quantum (Software company), data.
  • Events: booth, conference, announce, innovation, deadline, panel, speak, learn, week, share, meet, talk, discus, data.
  • Coworking: worker, office, workplace, future, work, team, hire, employee, liquidspace (coworking).
  • Fintech: workplace, ipo (Initial Public Offering), corporate, lend, invest, worker, equity, debt, finance, bank.
  • Real Estate: workplace, ipo (Initial Public Offering), corporate, worker, equity, employee engagement, software engine, mortage, broker, multifamily, build, loan, advisor, ar_vr (Augmented Reality and Virtual Reality).
  • Automobile: automot, advisor, corporate, booth, conference, panel, booth.
  • Cryptocurrency: cryptocurrency, technology, software engine, corporate, equity, company, startup, ceo, business, product, data.
  • Cannabis: cannabis, technology, corporate, equity, company, startup, ceo, business, product, data.
Topic 2 (27% of tokens)
  • Style: wear, flower, color, decoration, accesories, dress, shoe, hair, makeup, beauty, jewerly, gorgeous, design, woman, stylist, stylish, love, shop, check, deal, find, day.
  • Entertainment: wine, remix, vacation, dance, outdoor, honeymoon, summer, travel, music, fun, holiday, relax, gift, photo, video, weekend, giveaway, spring, adventure, season, friend, book,
  • Fitness: workout, gym, wear, yoga, dance, outdoor, beauty, perfect, wear, save, live, start, life, try, day
  • Food: kitchen, meal, delicious, restaurant, taste, favorite, shop, free, good, love, happy, time, deal, day.
Topic 3 (13.4% of tokens)
  • Customer Service: email address, sorry hear, kindly, patience, question, contact, sorry trouble, sorry inconvenience, sorry delay, experience issue, apology inconvenience, refund, feel free, please, email, support, issue, sorry, try, help, happy.
  • Apps (software) experience: send dm (direct message), android, html, app, sdk, appstore, shot dm, web, io (british tech domain), username, update, googleplay, Nuzzel (News Intelligence site), app, check, team.
  • Gaming: gamedev, GDC (Game Developers Conference), multiplay, hologram, arcade, game, ninja, twitch (live streaming platform for gamers), reward, well, dm (direct messaging), happy, check, team.
Topic 4 (9.1% of tokens)
  • Retail: ecommerce, omnichannel, nrf (Retail’s Big Show and Expo), shoporg, retaild, cpg (Customer Packaged Goods), mcommerce (mobile commerce), magento (payment open source platform), retail, shopper, top, market, sale, consume, online, revenue.
  • Marketing: market, email marketing, programmatic, digital mark, influencer marketing, media post, martech (marketing technology), customer experience, iab (Interactive Advertising Bureau), Ad Exchange, digiday, content marketing, ott (over-the-top), ugc (User Generated Content), scoop it (Software content marketing company), bizibl (business content company), shoptalk, loyalty, marketing automation, brand, advertising, ad, influence(r), campaign, social medium, audience, data, sale, consume, online, revenue.
Topic 5 (3.1% of tokens)
  • Healthness: health, healthcar, medic, digitalhealth, alzheim, caregiver, physician, doctor, clinic, healthit, mentalhealth, disease, healthtech, ehr (Electronic Health Record), infect, pediatr, hospit, healthit, nurs, risk, dental.
  • Pharmaceutical: patient, drug, diabet, fda (U.S. Food and Drug Administration), himss (Healthcare Information and Management Systems Society), medicin, cancer, vaccin, diagnosi, biotech, pharma, bacteria, diabet, dementia, gsk (GlaxoSmithKline pharmaceutical), antiobiot, chronic diseases, prediabet, rare diseases, fda approval, hypertension.
  • Sustainability: greenbiz (company), uganda, edison award, improv, prevent.
Topic 6 (2.4% of tokens)
  • Cyber security: vulnerability, breach, malware, random ware, microserv, account take over, data breach, phishing, firewall, ransomwar_attack, security, devops (development and operations), encrypt, threat, cyberattack, protect, bot, attack, analyt, vulnerable, kubernet, cio, monitor, backup
  • Data processing: mdm (Master Data Management), Kubernetes, data center, microserv, vm (Virtual Machine), hadoop (software platform), msp (Managed Service Provider), dockercon (software conferece), docker (software), multicloud, openstack, microsoft azure, devop, enterprise, deploy, storag, bot, bigdata, api, automation, monitor, application, demo
  • Tech news: dzone (tech news platform), infosec, crn (tech news), thenewstack, networkworld, iaa (International Advertising Asociation), sdtime, csoonline (news), analytics, webinar, gartner, vendor, booth
Topic 7 (1.5% of tokens)
  • Teen education: edchat, edchat_edtech, edutopia, edsurg, verb, blendedlearn, educationweek, teacher, student, school, math, classroom, colleg, educ, grade, textbook, school_district, teen, kid, lesson, parent, textbook, grader, grade.
  • Tecnología (technology in spanish and portuguese): tecnologi, empresa, compra, latam, hacer, semana, nuestra, vida.

First 7 topics are going to be considered as from that point on, the percentage of content (tokens) explained by topics get significantly low (< 1%).


Next Steps

Mapping the HVM’s ecosystem gives an input to define strategies inside PayU’s core business:

  1. Marketing and communication campaigns in the languajes, markets, events and industries that match better with target companies.
  2. Stands the parameters for events participation and profiling PayU’s speakers on each vertical and also suggest direct meeting agenda with Founders, VCs ands institutions in each event.
  3. Highlights the main conversation topics among HVM’s so that communication can be fluid and tracked in media to better Public Relations and Communications actions.
  4. The main conversation topics are insightfull for digital marketing content to be accurate to their interests and topics.
  5. Identifies Founders, Venture Capitals and Institutions that are top influencers in the target ecosystem so they are desiredable allies to work with.
  6. Commercial boost the lead database with commercial teams by venture stage, industry or geographic origin.